Final Project

Adam Guenoun, Indira Martinez, Nicholas Solis

Introduction:

Within this analysis, we’ll investigate factors correlated to diabetes. With a data set of 100,000 people, this investigation allows us to display relations between ages, HbA1c levels, smoking history, and glucose levels. With a wide range of data points, we begin to question if there are trends within this data that match our general understanding of diabetes. Our goal is to asses which of the 9 variables play a stronger role to the development of diabetes and if we can prove trends to better support our assumptions of this data. Through data visualization, chart analysis, and numerical analysis we will be able to present this data to convicne a general audience of the important factors that contribute to diabteic trends.

<<<<<<< HEAD
library(tidyverse) ## Loaded for dplyr
library(ggplot2) ## Loaded for plotting
library(plotly) ## Loaded for interactive plots
library(readr) ## Loaded to read in data
library(knitr) ## Loaded to compute and display data
library(scales) ## Loaded to scale data 

Diabetes Dataset

=======
library(tidyverse) ## Loaded for dplyr
library(ggplot2) ## Loaded for plotting
library(plotly) ## Loaded for interactive plots
library(readr) ## Loaded to read in data
library(knitr) ## Loaded to compute and display data
library(scales) ## Loaded to scale data 

Diabetes Dataset

>>>>>>> 3269be02134749ea203553e47da303a47edbc66e
100,000 × 9 (first 6 rows)
gender age hypertension heart_disease smoking_history bmi HbA1c_level blood_glucose_level diabetes
Female 80 0 1 never 25.19 6.6 140 0
Female 54 0 0 No Info 27.32 6.6 80 0
Male 28 0 0 never 27.32 5.7 158 0
Female 36 0 0 current 23.45 5.0 155 0
Male 76 1 1 current 20.14 4.8 155 0
Female 20 0 0 never 27.32 6.6 85 0
<<<<<<< HEAD

Male vs. Female Blood Sugar Levels (HbA1c) Plot

=======

Male vs. Female Blood Sugar Levels (HbA1c)

>>>>>>> 3269be02134749ea203553e47da303a47edbc66e

Similar Prevalence of Prediabetes – The proportion of individuals categorized as having prediabetes (HbA1c 5.7% - 6.4%) is almost identical between males (41.3%) and females (41.4%). This suggests that prediabetes affects both genders at nearly the same rate.

Females Have a Slightly Higher Proportion of Normal Blood Sugar Levels – More females (38.4%) fall into the normal blood sugar category (<5.7%) compared to males (37.1%). This may indicate some slight protective factors or lifestyle differences in this group.

Since more males are in the diabetes category, there could be gender-related risk factors worth exploring—such as diet, activity levels, or genetic predisposition.

Overall, blood sugar regulation patterns appear fairly balanced between genders, but small differences suggest potential areas for further investigation.

Similar Prevalence of Prediabetes The proportion of individuals classified as having prediabetes (HbA1c 5.7% - 6.4%) is nearly identical between males (41.3%) and females (41.4%). This suggests no significant disparity.

99,982 x 4 (first 5 rows)
gender diabetes HbA1c_level HbA1c_category
Female 0 6.6 Diabetic ≥ 6.5%
Female 0 6.6 Diabetic ≥ 6.5%
Male 0 5.7 Prediabetic 5.7% - 6.4%
Female 0 5.0 Normal < 5.7%
Male 0 4.8 Normal < 5.7%
6 x 4 (first 5 rows)
gender HbA1c_category n percent
Female Diabetic ≥ 6.5% 11835 20.21280
Female Normal < 5.7% 22492 38.41372
Female Prediabetic 5.7% - 6.4% 24225 41.37348
Male Diabetic ≥ 6.5% 8959 21.62443
Male Normal < 5.7% 15358 37.06976
<<<<<<< HEAD

Age Distribution in Diabetes, Heart Disease, and Hypertension Plot

=======

Age Distribution in Diabetes, Heart Disease, and Hypertension

>>>>>>> 3269be02134749ea203553e47da303a47edbc66e
358 x 5 (first 5 rows)
age diabetes heart_disease hypertension group
57 1 1 1 Diabetes, H.D, and Hyp.
62 1 1 1 Diabetes, H.D, and Hyp.
62 1 1 1 Diabetes, H.D, and Hyp.
67 1 1 1 Diabetes, H.D, and Hyp.
72 1 1 1 Diabetes, H.D, and Hyp.
81,885 x 5 (first 5 rows)
age heart_disease diabetes hypertension group
54 0 0 0 Free of Diabetes, H.D, and Hyp.
28 0 0 0 Free of Diabetes, H.D, and Hyp.
36 0 0 0 Free of Diabetes, H.D, and Hyp.
20 0 0 0 Free of Diabetes, H.D, and Hyp.
79 0 0 0 Free of Diabetes, H.D, and Hyp.
<<<<<<< HEAD

BMI Distribution by Hypertension Status Plot

=======

BMI Distribution by Hypertension Status Plot

>>>>>>> 3269be02134749ea203553e47da303a47edbc66e

Shows the distribution of BMI values based on hypertension status. A violin plot is great for visualizing the distribution and density of BMI across hypertension categories,

Shape and width: The width of each “violin” represents the density of BMI values at different levels. Wider sections mean more individuals have that BMI, while narrower sections indicate fewer people at those values.

Comparison of distributions: The blue violin represents people without hypertension (hypertension = 0), while the red violin represents those with hypertension (hypertension = 1). By comparing them, you can see how BMI differs between these groups.

The horizontal line around 25 BMI: This marks the median BMI for each group. Since both violins have a horizontal line in roughly the same position, it suggests that the median BMI is around 25 for both hypertensive and non-hypertensive individuals.

Density trends: If the violins have different thicknesses in certain BMI ranges, it tells you which BMI values are more or less common in each group. People with hypertension seem to have a higher BMI overall, but both groups share a similar median.

The distribution shape is different—for example, if one violin is wider at higher BMI values, it suggests that hypertension is more common among individuals with higher BMI.

Outliers or extreme values might appear as small bulges or extended tails at the ends of the violins, showing individuals with very high or low BMI.

10,000 x 9 (first 5 rows)
gender age hypertension heart_disease smoking_history bmi HbA1c_level blood_glucose_level diabetes
Female 80 0 1 never 25.19 6.6 140 0
Female 54 0 0 No Info 27.32 6.6 80 0
Male 28 0 0 never 27.32 5.7 158 0
Female 36 0 0 current 23.45 5.0 155 0
Male 76 1 1 current 20.14 4.8 155 0
<<<<<<< HEAD
=======
>>>>>>> 3269be02134749ea203553e47da303a47edbc66e

The graph below is separated by whether or not a person has hypertension. With the comparison of BMI as the range, it’s seen that majority of people with and without hypertension lie within a BMI range of 25-29. Notice that for people with hypertension, the desnity population above the red line is greater than that of people without hypertension; indicating that there’s a larger of population of people with hypertension that have a larger BMI

<<<<<<< HEAD
======= >>>>>>> 3269be02134749ea203553e47da303a47edbc66e
96,713 x 3 (first 5 rows)
age diabetes blood_glucose_level
80 No Diabetes 140
54 No Diabetes 80
28 No Diabetes 158
36 No Diabetes 155
76 No Diabetes 155
<<<<<<< HEAD

BMI vs. Age Across Diabetes & Heart Disease Plot

8,500 x 4 (first 5 rows)
age bmi diabetes condition
44 19.31 1 Diabetes Only
67 27.32 1 Diabetes Only
50 27.32 1 Diabetes Only
73 25.91 1 Diabetes Only
53 27.32 1 Diabetes Only
3,942 x 4 (first 5 rows)
age bmi heart_disease condition
80 25.19 1 Heart Disease Only
76 20.14 1 Heart Disease Only
72 27.94 1 Heart Disease Only
67 27.32 1 Heart Disease Only
77 32.02 1 Heart Disease Only

A relation to BMI and Heart Disease

=======

BMI vs. Age Across Diabetes & Heart Disease Plot

A relation to BMI and Heart Disease

>>>>>>> 3269be02134749ea203553e47da303a47edbc66e

Each person within this scale has heart disease. Here a comparison is made between declared underweight and overweight people, grouped by sex, based on a BMI scale. There’s a significant increase in population percentage for those who are considered overweight and that have heart disease. With visual aid, it can be concluded that as weight increases, chances of heart disease will increase.

<<<<<<< HEAD

An excpetion?

=======

An excpetion?

>>>>>>> 3269be02134749ea203553e47da303a47edbc66e

The data here is heavily dependent on BMI scale. It is important to note that BMI is not really a great determination for those who have diabetes, but there is a general trend within the data that people who have a BMI over 30 are more likely to be diabetic.

<<<<<<< HEAD

=======

The plot below shows the population density of men based on diabetes status, scaled by age range

The plot below shows the population density of women based on diabetes status, scaled by age range

>>>>>>> 3269be02134749ea203553e47da303a47edbc66e

Smokers go brrr# Smokers go brrrgeom_boxplot()

In the smoking data there are 6 unique values

  1. Never: Has Never smoked
  2. Not current: Has smoked but is not currently smoking
  3. Former: Has quit smoking (abstained for longer than)
  4. Current: Is currently a smoker
  5. Ever: Has ever smoked regardless of current smoking status
  6. No Info: No smoking history information available

The total amount of people who fall into each category is as follows;

  1. Never: 35095
  2. Not current: 6447
  3. Former: 9352
  4. Current: 9286
  5. Ever: 4004
  6. No Info: 35816

There is quite a sizable amount of people in the No info category.

The total number of people in the dataset is 100000. To help clean up the data, we can filter ‘No Info’ people out. When we do that we get 64184.

# Figure out the unique categories of smoking history
unique(diabetes_dataset$smoking_history)
## [1] "never"       "No Info"     "current"     "former"      "ever"       
## [6] "not current"
# Count amount of people who belong to each unique smoking category
#Omit No info

smoking_tally <- diabetes_dataset %>% filter(smoking_history != 'No Info') %>%  group_by(smoking_history) %>% summarise(total_people = n())

#group diabetic vs non diabetic people together

smoking_diabetes_dataset <- diabetes_dataset %>%
  filter(smoking_history != 'No Info') %>%  
  group_by(smoking_history, diabetes) %>%  
  summarise(total = n())
## `summarise()` has grouped output by 'smoking_history'. You can override using
## the `.groups` argument.
smoking_diabetes_dataset
## # A tibble: 10 × 3
## # Groups:   smoking_history [5]
##    smoking_history diabetes total
##    <chr>              <dbl> <int>
##  1 current                0  8338
##  2 current                1   948
##  3 ever                   0  3532
##  4 ever                   1   472
##  5 former                 0  7762
##  6 former                 1  1590
##  7 never                  0 31749
##  8 never                  1  3346
##  9 not current            0  5757
## 10 not current            1   690
# Inner join tally data with diabetic grouped data;
#mutate a column to create a percentage per category;
#select desired columns

smoking_diabetes_percentage <- inner_join(smoking_tally, smoking_diabetes_dataset, by = 'smoking_history') %>% mutate(Percentage = total/total_people *100) %>% select(smoking_history, diabetes, total, Percentage)

smoking_diabetes_percentage
## # A tibble: 10 × 4
##    smoking_history diabetes total Percentage
##    <chr>              <dbl> <int>      <dbl>
##  1 current                0  8338      89.8 
##  2 current                1   948      10.2 
##  3 ever                   0  3532      88.2 
##  4 ever                   1   472      11.8 
##  5 former                 0  7762      83.0 
##  6 former                 1  1590      17.0 
##  7 never                  0 31749      90.5 
##  8 never                  1  3346       9.53
##  9 not current            0  5757      89.3 
## 10 not current            1   690      10.7

Now we can graph the relationship between smoking and diabetes as separated by smoking category.

library(ggplot2)
library(plotly)

#Create initial graph about smoking and diabetes

smoking_diabetes_graph <- ggplot(smoking_diabetes_percentage) + 
  geom_col(aes(x = smoking_history, y = total, fill = as.factor(diabetes)), position = 'dodge') 

smoking_diabetes_graph <- smoking_diabetes_graph + coord_flip() + labs(
  title = 'Smoking History and Diabetes Relationship',
  y = 'People',
  x = 'Smoking History',
  fill = 'Has Diabetes',
  caption = "1 indicates diabetes, 0 indicates no diabetes"
  ) 

smoking_diabetes_graph

ggplot(smoking_diabetes_percentage) + geom_col(aes(x = smoking_history, y = Percentage, fill = as.factor(diabetes)))